Authors: Sam Coleman and Jacob Tye
This project explores the application of machine learning techniques to DNA methylation data from The Cancer Genome Atlas (TCGA) and cBioPortal.
The primary objectives are:
The analyses utilize publicly available data:
Data preparation scripts are located in the analysis/data_prep directory.
Note: Data is not included in GitHub repo due to GitHub sizing restraints. To get the data please see the analysis/data_prep directory and the samples used in data/methylation/all_samples_450K.tsv
1.1 Local Linear Embedding with Logistic Regression
Implementation: analysis/01_1_dim_red_prediction.ipynb
SHAP Troubleshooting
1.1.1 No Dimensionality Reduction
analysis/shap_memory_troubleshoot/01_logistic_regressor.ipynb1.1.2 HM27 Data
1.2 Principal Component Analysis with Gradient Boosting
analysis/01_2_gradient_boosting.ipynbBoth models achieved over 97% accuracy on the testing dataset.
2.1 Unsupervised:
K-Means clustering, resulting in a normalized mutual information (NMI) score of 0.2902.
analysis/02_1_kmeans_clustering.ipynb2.2 Supervised:
Neural network following feature dimensionality reduction, achieving an NMI score of 0.58.
analysis/02_02_neural_network.ipynbThe models effectively differentiate tumors from normal cells, indicating the potential of integrating methylation data with machine learning for early detection and diagnosis.
The lower performance in subtype classification suggests epigenetic heterogeneity within current categorization systems, highlighting opportunities for refinement.
Presentation of project is in final_project_presentation.pptx
Full report for this project is in ML_Project_Report.pdf
Cerami et al. "The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data." Cancer Discovery, May 2012. PubMed
Gao et al. "Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal." Sci. Signal., 2013. PubMed
de Bruijn et al. "Analysis and Visualization of Longitudinal Genomic and Clinical Data from the AACR Project GENIE Biopharma Collaborative in cBioPortal." Cancer Res, 2023. PubMed
The results presented are based on data generated by the TCGA Research Network: https://www.cancer.gov/tcga.